Parallel manager2 by lolyu · Pull Request #1 · lolyu/sonic-mgmt

lolyu · 2026-02-02T04:18:23Z

Description of PR

Summary:
Fixes # (issue)

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

lolyu · 2026-02-02T04:31:58Z

@greptile

greptile-apps · 2026-02-02T04:37:56Z

Greptile Overview

Greptile Summary

This PR introduces a parallel fixture execution framework to optimize test execution time by parallelizing fixture setup/teardown operations. The implementation includes:

Parallel Fixture Manager: A new pytest plugin (parallel_fixture) that uses ThreadPoolExecutor to execute fixture setup/teardown tasks concurrently across different scopes (SESSION, MODULE, CLASS, FUNCTION)
Scope-based barriers: Automatic synchronization points that ensure all parallel tasks complete before proceeding to the next test scope
Deadlock prevention: Fork-safety mechanisms using os.register_at_fork to prevent logging and Ansible display deadlocks in multi-threaded/multi-process environments
Closure bug fixes: Corrected several __getattr__ implementations in base.py, duthosts.py, and multi_asic.py where module names were incorrectly captured by reference instead of by value
Docker capability: Added SYS_PTRACE capability to enable thread interruption for timeout handling

The architecture is well-documented and addresses real concurrency challenges in the SONiC test infrastructure. However, there are several logical issues that need attention:

Missing return value check in PyThreadState_SetAsyncExc could cause silent failures when interrupting threads
Exception re-raising loses original traceback context, making debugging harder
The reset() method doesn't clear the terminated flag, potentially preventing the manager from being reused
Potential AttributeError when checking handler.lock that could be None

Confidence Score: 3/5

This PR has logical issues that could cause runtime failures in edge cases but is generally sound in design
The PR implements a complex parallel execution framework with good architecture and documentation. However, several critical logical issues were found: (1) missing return value checks for thread injection API that could silently fail, (2) exception handling that loses traceback context, (3) the reset mechanism doesn't clear the terminated flag which breaks reusability, and (4) potential None reference on handler.lock. These issues should be addressed before merging to ensure reliability in production.
tests/common/plugins/parallel_fixture/__init__.py and tests/common/helpers/parallel.py need attention for the logical issues mentioned

Important Files Changed

Filename	Overview
tests/common/devices/base.py	Fixed closure bug where `module_name` was captured incorrectly by returning a wrapper function
tests/common/devices/duthosts.py	Fixed closure bug and added explicit initialization of `_nodes` attributes to prevent `__getattr__` recursion
tests/common/devices/multi_asic.py	Fixed closure bug where `multi_asic_attr` was captured incorrectly by returning a wrapper function
tests/common/helpers/parallel.py	Added deadlock prevention for fork operations with logging locks, patched Ansible worker processes
tests/common/plugins/parallel_fixture/init.py	Implemented parallel fixture execution framework with thread pool, scope-based barriers, and exception handling

Sequence Diagram

sequenceDiagram
    participant Pytest
    participant ParallelManager
    participant ThreadPool
    participant MonitorThread
    participant Fixture
    participant Barrier

    Note over Pytest,Barrier: Session Setup Phase
    Pytest->>ParallelManager: Create (session scope)
    ParallelManager->>ThreadPool: Initialize worker threads
    ParallelManager->>MonitorThread: Start monitoring
    
    Pytest->>Fixture: Execute fixture (session scope)
    Fixture->>ParallelManager: submit_setup_task(SESSION, func)
    ParallelManager->>ThreadPool: Submit task to thread pool
    ThreadPool-->>ParallelManager: Return future
    Fixture-->>Pytest: Yield immediately
    
    Pytest->>Barrier: setup_barrier_session
    Barrier->>ParallelManager: wait_for_setup_tasks(SESSION)
    ParallelManager->>ThreadPool: Wait for all session tasks
    ThreadPool-->>ParallelManager: Tasks complete
    
    Note over Pytest,Barrier: Module/Class/Function Scopes
    Pytest->>Fixture: Execute fixture (module scope)
    Fixture->>ParallelManager: submit_setup_task(MODULE, func)
    ParallelManager->>ThreadPool: Submit to pool
    Fixture-->>Pytest: Yield
    
    Pytest->>Barrier: setup_barrier_module
    Barrier->>ParallelManager: wait_for_setup_tasks(MODULE)
    ParallelManager->>ThreadPool: Wait for module tasks
    
    Note over Pytest,Barrier: Test Execution
    Pytest->>ParallelManager: pytest_runtest_call hook
    ParallelManager->>ThreadPool: Ensure all tasks complete
    ParallelManager->>ThreadPool: Terminate executor
    Pytest->>Pytest: Run test function
    
    Note over Pytest,Barrier: Teardown Phase
    Pytest->>ParallelManager: pytest_runtest_teardown hook
    ParallelManager->>ThreadPool: Reset and create new executor
    ParallelManager->>MonitorThread: Restart monitoring
    
    Pytest->>Fixture: Teardown fixture
    Fixture->>ParallelManager: submit_teardown_task(scope, func)
    ParallelManager->>ThreadPool: Submit teardown task
    Fixture-->>Pytest: Return
    
    Pytest->>Barrier: teardown_barrier_function
    Barrier->>ParallelManager: wait_for_teardown_tasks(FUNCTION)
    ParallelManager->>ThreadPool: Wait for function teardowns
    
    Pytest->>ParallelManager: pytest_runtest_logreport hook
    ParallelManager->>ThreadPool: Terminate executor
    ParallelManager->>MonitorThread: Stop monitoring

greptile-apps

_{5 files reviewed, 5 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-02T04:38:00Z

tests/common/plugins/parallel_fixture/__init__.py

+def raise_async_exception(tid, exc_type):
+    """Injects an exception into the specified thread."""
+    if not isinstance(tid, int):
+        raise TypeError("Thread ID must be an integer")
+
+    ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid),
+                                               ctypes.py_object(exc_type))


The PyThreadState_SetAsyncExc call doesn't check the return value. According to Python docs, it returns the number of thread states modified. If it returns 0, the thread ID is invalid. If it returns >1, multiple thread states were affected (shouldn't happen but indicates a serious issue).

Suggested change

def raise_async_exception(tid, exc_type):

"""Injects an exception into the specified thread."""

if not isinstance(tid, int):

raise TypeError("Thread ID must be an integer")

ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid),

ctypes.py_object(exc_type))

def raise_async_exception(tid, exc_type):

"""Injects an exception into the specified thread."""

if not isinstance(tid, int):

raise TypeError("Thread ID must be an integer")

ret = ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid),

ctypes.py_object(exc_type))

if ret == 0:

logging.warning("Thread ID %s not found when injecting exception", tid)

elif ret > 1:

# Revert if we affected multiple threads

ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid), None)

raise SystemError(f"PyThreadState_SetAsyncExc affected {ret} threads")

Prompt To Fix With AI

This is a comment left during a code review. Path: tests/common/plugins/parallel_fixture/__init__.py Line: 38:44 Comment: The `PyThreadState_SetAsyncExc` call doesn't check the return value. According to Python docs, it returns the number of thread states modified. If it returns 0, the thread ID is invalid. If it returns >1, multiple thread states were affected (shouldn't happen but indicates a serious issue). ```suggestion def raise_async_exception(tid, exc_type): """Injects an exception into the specified thread.""" if not isinstance(tid, int): raise TypeError("Thread ID must be an integer") ret = ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid), ctypes.py_object(exc_type)) if ret == 0: logging.warning("Thread ID %s not found when injecting exception", tid) elif ret > 1: # Revert if we affected multiple threads ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid), None) raise SystemError(f"PyThreadState_SetAsyncExc affected {ret} threads") ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-02T04:38:01Z

tests/common/plugins/parallel_fixture/__init__.py

+            try:
+                return func(*args, **kwargs)
+            except Exception:
+                raise sys.exc_info()[0](traceback.format_exc())


The exception re-raising logic uses deprecated sys.exc_info()[0] and loses the original exception chain. This makes debugging harder and may not properly preserve the exception context.

Suggested change

raise sys.exc_info()[0](traceback.format_exc())

raise

Prompt To Fix With AI

This is a comment left during a code review. Path: tests/common/plugins/parallel_fixture/__init__.py Line: 266:266 Comment: The exception re-raising logic uses deprecated `sys.exc_info()[0]` and loses the original exception chain. This makes debugging harder and may not properly preserve the exception context. ```suggestion raise ``` How can I resolve this? If you propose a fix, please make it concise.

tests/common/helpers/parallel.py

greptile-apps · 2026-02-02T04:38:03Z

tests/common/plugins/parallel_fixture/__init__.py

+        self.active_futures.clear()
+        self.done_futures.clear()


Clearing the futures before termination check creates a race condition. If terminate() hasn't been called yet, clearing these sets will lose track of active tasks. The terminated flag should be checked before clearing.

Suggested change

self.active_futures.clear()

self.done_futures.clear()

if not self.terminated:

raise RuntimeError("Cannot reset a running parallel fixture manager.")

self.active_futures.clear()

self.done_futures.clear()

Prompt To Fix With AI

This is a comment left during a code review. Path: tests/common/plugins/parallel_fixture/__init__.py Line: 393:394 Comment: Clearing the futures before termination check creates a race condition. If `terminate()` hasn't been called yet, clearing these sets will lose track of active tasks. The `terminated` flag should be checked before clearing. ```suggestion if not self.terminated: raise RuntimeError("Cannot reset a running parallel fixture manager.") self.active_futures.clear() self.done_futures.clear() ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-02T04:38:04Z

tests/common/plugins/parallel_fixture/__init__.py

+    def reset(self):
+        """Reset the parallel fixture manager."""
+        if not self.terminated:
+            raise RuntimeError("Cannot reset a running parallel fixture manager.")
+
+        self.active_futures.clear()
+        self.done_futures.clear()
+        self.executor = ThreadPoolExecutor(max_workers=self.worker_count)
+        self.is_monitor_running = True
+        self.monitor_thread = threading.Thread(target=self._monitor_workers, daemon=True)
+        self.monitor_thread.start()


reset() doesn't set self.terminated = False after resetting the executor. This means once terminated, the manager can never be used again because terminate() will immediately return at line 335 due to the terminated check. Either reset should set terminated = False, or the design should clarify if reset is only for internal reuse.

Prompt To Fix With AI

This is a comment left during a code review. Path: tests/common/plugins/parallel_fixture/__init__.py Line: 388:398 Comment: `reset()` doesn't set `self.terminated = False` after resetting the executor. This means once terminated, the manager can never be used again because `terminate()` will immediately return at line 335 due to the terminated check. Either reset should set `terminated = False`, or the design should clarify if reset is only for internal reuse. How can I resolve this? If you propose a fix, please make it concise.

Replace shell-based openssl certificate generation with Python cryptography library. Certificates are now backdated by 1 day to handle clock skew between test host, DUT, and PTF container. This fixes the error: "tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time ... is before ..." Changes: - Add tests/common/cert_utils.py with TlsCertificateGenerator class - Add create_gnmi_cert_generator() factory for gNMI/gNOI naming - Update grpc_fixtures.py to use Python-based cert generation - Remove localhost fixture dependency from setup_gnoi_tls_server Signed-off-by: Dawei Huang <daweihuang@microsoft.com>

What is the motivation for this PR? Test DUTs with 64 ports connected to peers. How did you do it? How did you verify/test it? Prepare the DUT and verified bgp/test_bgp_fact passes. Signed-off-by: Vinay Kaza <vinay@nexthop.ai> * Updated with latest tested topology Signed-off-by: Vinay Kaza <vinay@nexthop.ai> * Fix end of file static checker Signed-off-by: Vinay Kaza <vinay@nexthop.ai> * Change neighbor VM name from T1 to LT2 Signed-off-by: Vinay Kaza <vinay@nexthop.ai> --------- Signed-off-by: Vinay Kaza <vinay@nexthop.ai>

…sonic-net#21990) What is the motivation for this PR? Running many individual SSH commands in test scripts resulted in excessive overhead due to repeated SSH session establishments, leading to unnecessarily long test times. The total test time was 1 hr 26 mins, current implementation makes one SSH call for 450 ports and its each 7 UC queues separately, making roughly 3150 ssh commands assuming each SSH call takes 1 to 1.5 mins, 1 hr 20 mins of time is spent here fetching the data. The motivation is to improve test efficiency and make large test runs more practical by reducing SSH connection overhead. How did you do it? Identified places in the test where SSH commands are executed in sequence or within loops. Rewrote those sections to batch multiple commands together and execute them through a single SSH session, fetching show queue counters for all ports and queue together. Ensured that command outputs and error handling remained compatible with existing test validations. How did you verify/test it? Ran affected test scripts before and after the changes to compare results and execution time. Verified that all expected outputs and pass/fail criteria remain the same. Monitored for regressions, failures, or unexpected test behaviors possibly introduced by batching commands. The execution time dropped from 1 hr 26 mins to 3 mins. Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>

) Summary: Replace fixed sleeps with polling and reduce wait times: Add polling helpers: wait_for_crm_counter_update(), wait_for_resource_stabilization() Replace 50s resource waits with adaptive polling Reduce config waits from 10s to 5s (CONFIG_UPDATE_TIME) Reduce cleanup wait from 50s to 20s (SONIC_RES_CLEANUP_UPDATE_TIME) Signed-off-by: AntonHryshchuk <antonh@nvidia.com>

…le topos. (sonic-net#22169) What is the motivation for this PR? The generic hash test run time depends on the number of egress ports in the topology. When running on the scale topology like t0-isolated-d256u256s2, it takes more than 6 hours. This change is to optimize the runtime. The regular topologies also benefit from this. How did you do it? Decrease the total number of sent packets based on the number of egress ports. Meanwhile relax the threshold for balancing check also based on the number of egress port. Skip the lag related cases on topologies which have no lags. Skip some test cases when there is not enough test ports or the variations in the tested field are too few. For example when the hash field is IP_PROTOCOL, there are total 256 values, it not enough to hash to all 256 next hops perfectly. Improve the test result output. Add asic db route check in reboot test case. After reboot/reload, it takes time for the default route to be installed in the ASIC, only checking the kernel route is not enough for scale topos. How did you verify/test it? Run the test on SN5640 with t0-isolated-d256u256s2 topo. The total runtime is about 1.5 ~ 2 hours now. Signed-off-by: Cong Hou <congh@nvidia.com>

…onic-net#22287) Description of PR Changed container_upgrade tests to support restapi containers (restapi, restapi_watchdog, and restapi_sidecar). Summary: Microsoft ADO ID: 36672179 Approach What is the motivation for this PR? Adding support for restapi containers in container_upgrade tests so that we can run restapi tests using the same container images on different OS versions. How did you do it? Added instructions for creating restapi containers and running restapi tests in container_upgrade. How did you verify/test it? Tested on a T1 switch running 202505. The restapi tests passed.

What is the motivation for this PR? Fix test_bgp_suppress_fib.py flakiness on scale topos. Extension of BULK_TRAFFIC_WAIT_TIME and ignoring malformed BGP route packets allows these tests to pass consistently on Arista t1 isolated DUTs How did you do it? How did you verify/test it? Tested on T0 and T1 isolated Arista DUTs

* Fix AttributeError in conftest.py In the ptfhosts pytest fixture in conftest.py, a method "_hosts.apend" appears, which is a typo corrected by this change to "_hosts.append". Signed-off-by: Christopher Croy <ccroy@arista.com> * Fix flake8 lint indentation error Signed-off-by: Christopher Croy <ccroy@arista.com> --------- Signed-off-by: Christopher Croy <ccroy@arista.com>

…ic-net#22201) orch_northbond_route_zmq_enabled was enabled in: sonic-net#20441 which enables fpmsyncd to send route events to orchagent via a ZMQ channel instead of Redis. test_bgp_route_with_suppress_negative_operation kills orchagent and restarts bgp session. After restart, fpmsyncd tries to send all updates to orchagent via ZMQ, but orchagent is still down. Eventually, ZMQ buffer is filled up and causes the error. Ignore this error log as it is expected. Signed-off-by: markxiao <markxiao@arista.com>

Signed-off-by: Austin Pham <austinpham@microsoft.com>

…faces data for the test

…ication to handle IPV6 only topologies (sonic-net#21863)

…c-net#21561) What is the motivation for this PR? Expand the current NTP test case to verify that changing the source interface configuration for NTP works correctly. How did you do it? Add a test case to change the NTP source interface to Loopback0, and then back to eth0. How did you verify/test it? Tested on T0 KVM. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

What is the motivation for this PR? Add a test case to verify the ability to add and remove DNS nameservers via incremental config. How did you do it? Model this after the NTP GCU test case, and add two random DNS nameservers and remove one. How did you verify/test it? Tested locally on KVM T0. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

…System (sonic-net#20872) What is the motivation for this PR? sonic-mgmt tests need to be improved for FS voq How did you do it? access to Chassis DB is done only if the voq is a chassis skipped test_po_voq.py for single ASIC FS as this test checks for additons/deletions/updates of LAGs across all asics in a chassis using chassis APP DB How did you verify/test it? ran the tests on FS voq Signed-off-by: Lakshmi Yarramaneni <lakshmi@nexthop.ai> Co-authored-by: Xin Wang <xiwang5@microsoft.com>

* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401)  ### Description of PR  Summary: Fixes # (issue) Fixes below json syntax error. It is seen only when dut is prepared with macsec enable flag. json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 3 (char 4) ### Type of change  - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation  Signed-off-by: rajshekhar <rajshekhar@nexthop.ai> * ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420)  ### Description of PR Redundant override config is avoided as no macsec profile is set in the prepare phase. Below are details how macsec profile configurations are rendered: PREPARE phase: Uses generate_t2_golden_config_db() → template rendering → file-based config RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands → immediate CONFIG_DB update Summary: Fixes # (issue) ### Type of change  - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation  Signed-off-by: rajshekhar <rajshekhar@nexthop.ai> * ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562)  ### Description of PR  Summary: • Add a StartLimitHit-safe restart helper and use it in MACsec docker restart test to reduce flakiness • New helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py: • Proactively clears systemd failure counters (systemctl reset-failed) • Attempts restart, detects systemd rate limiting (StartLimitHit), applies bounded backoff (default 35s), then start • Verifies the target container becomes running within a timeout • Update tests/macsec/test_docker_restart.py to use the new helper instead of duthost.restart_service("macsec") Fixes # (issue) MACsec docker restart tests can intermittently fail due to systemd rate limiting after repeated restarts during teardown/restart cycles. • Guarding against StartLimitHit with a clear backoff-and-start flow improves test reliability without changing device behavior. ### Type of change  - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? • MACsec docker restart tests can intermittently fail when systemd enforces StartLimitHit due to rapid restart attempts during teardown/restart cycles. • This PR makes the restart path resilient to StartLimitHit by proactively clearing counters, applying bounded backoff, and verifying the container reaches the running state, thereby reducing test flakiness. #### How did you do it? • Added a helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that: • Detects StartLimitHit pre/post restart attempts • Runs systemctl reset-failed to clear counters • Applies a fixed backoff when rate-limited, then systemctl start • Verifies the container is running within a configurable timeout using existing wait_until/state checks • Updated tests/macsec/test_docker_restart.py to use the helper instead of a direct duthost.restart_service("macsec") call. #### How did you verify/test it? • Local validation in lab: • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled. • Repeated the restart sequence to emulate rate limiting scenarios. • Verified the helper reliably recovers from StartLimitHit and the container becomes running within the timeout. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation  Signed-off-by: rajshekhar <rajshekhar@nexthop.ai> * NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678) NOS-3311 tracks MACsec test flakiness caused by races between: * wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE DB), and * the test harness eagerly reading that state to build `MACSEC_INFO` (via `get_macsec_attr`). This can manifest as exceptions like `KeyError('sak')` when the MACsec egress SA row does not yet exist, even though `MACSEC_PORT_TABLE` already shows `enable_encrypt="true"`. There are also cleanup races where tests check for removal of MACsec DB entries before the background cleanup logic has finished. This PR adds two pieces of synchronization in sonic-mgmt: 1. Ensure MKA establishment before pre-loading MACsec session info for tests 2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec File: `tests/common/macsec/__init__.py` * The `load_macsec_info` fixture (module-scoped, autouse) previously called `load_all_macsec_info()` immediately when MACsec was enabled and a profile was present. That in turn calls `get_macsec_attr()`, which expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully programmed. * In environments where MACsec is pre-configured before tests start, this created a race: `MACSEC_PORT_TABLE` might already exist (with `enable_encrypt="true"`), but the egress SA row for the active AN might not yet have been written to APP_DB, leading to `KeyError('sak')` when `macsec_sa["sak"]` is accessed. * Fix: * When MACsec is enabled and a profile is present, the fixture now first *attempts* to resolve the `wait_mka_establish` fixture: ```python try: request.getfixturevalue('wait_mka_establish') except Exception: pass ``` * `wait_mka_establish` is defined in `tests/macsec/conftest.py` and internally uses `check_appl_db` plus `wait_until(...)` to ensure APP/STATE DB MACsec SC/SA tables are populated (including `sak`/`auth_key`/PN relationships) before returning. * If the fixture is not defined (e.g., in other environments or test suites), the code falls back to the previous behavior. * After this synchronization point, if `is_macsec_configured(...)` is true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for all control links. Otherwise, the original `macsec_setup` flow is triggered. This makes `get_macsec_attr()` execution order consistent with the rest of the MACsec test suite, which already relies on `wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and SAKs exist before validating state. cleanup File: `tests/common/macsec/macsec_config_helper.py` * Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export it via `__all__`. * This helper is designed for tests that: * disable MACsec on one or more interfaces, and then * need to assert that all associated MACsec entries (port, SC, SA) have been automatically removed from Redis before proceeding. * Behavior: * For EOS neighbors, it is a no-op: they do not use Redis DBs and the function returns `True` immediately. * For SONiC hosts, it: * Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics` with patterns `MACSEC_*:{interface}*` (APPL_DB) and `MACSEC_*|{interface}*` (STATE_DB). * Aggregates any remaining keys per DB. * Returns `True` as soon as all such keys are gone for the given interfaces, logging total time taken. * If the `timeout` is exceeded, logs a warning, prints a summary of remaining entries, and returns `False`. * This centralizes the logic for “wait until MACsec entries are gone from Redis” instead of having ad hoc sleeps or partial checks in individual tests. * MACsec control-plane actions (via wpa_supplicant and swss/macsecorch) are asynchronous relative to the tests. It is valid for `MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs and their SAKs are still being programmed. * `get_macsec_attr()` assumes that: * APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a valid `encoding_an`, and * APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a `sak` field. Without synchronization, tests that pre-load `MACSEC_INFO` can hit a window where the SA row does not yet exist and crash with `KeyError('sak')`. * By tying `load_macsec_info` to `wait_mka_establish` where available, we ensure those pre-loads happen only after the expected MACsec state has been fully written to Redis. * Similarly, when disabling MACsec, asynchronous background cleanup can lag behind the test’s expectations. Having a dedicated, reusable `wait_for_macsec_cleanup` helper lets future tests explicitly wait for cleanup completion instead of guessing with sleeps. * Verified that the new fixtures and helpers are imported and wired correctly: * `load_macsec_info` remains `autouse=True` at module scope, so existing MACsec tests automatically benefit from the additional synchronization. * `wait_for_macsec_cleanup` is exported in `__all__` for use by future MACsec tests. * Manually exercised MACsec configuration and teardown flows in a MACsec-enabled testbed (e.g., humm120) to confirm: * MACsec sessions establish successfully and APP/STATE DB contain expected MACsec entries before `load_all_macsec_info` is invoked. * Disabling MACsec followed by `wait_for_macsec_cleanup` results in all MACSEC_* keys being removed from APPL/STATE DB within the timeout window. --- Pull Request opened by [Augment Code](https://www.augmentcode.com/) with guidance from the PR author Signed-off-by: rajshekhar <rajshekhar@nexthop.ai> * taking care of review comments - Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited. - Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing. - Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers. Signed-off-by: rajshekhar <rajshekhar@nexthop.ai> --------- Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

…r handling (sonic-net#22082) Description of PR Summary: Fixes ethernet interface test failures due to port availability issues and routeCheck errors on multi-ASIC platforms The below PR is also needed to fix the test completely: PR for FEC related issue - sonic-net#21026 Type of change Bug fix Testbed and Framework(new/improvement) New Test case Skipped for non-supported platforms Test case improvement What is the motivation for this PR? Ethernet interface tests would fail when all ethernet ports were members of PortChannels and no free ports were available for testing. Additionally, tests were failing during teardown with routeCheck errors, particularly on multi-ASIC platforms, where port configuration changes (speed, FEC, admin status) triggered temporary routing inconsistencies that were flagged by the routeCheck monitoring service. How did you do it? Enhanced port availability logic: Added get_test_port() to automatically remove a port from its PortChannel if no free ports are available Added remove_port_from_portchannel() helper function Port removal is temporary - the ensure_dut_readiness fixture's rollback mechanism automatically restores the original configuration Added MTU test protection: checks for free ports without removing from PortChannel (remove_from_portchannel=False), and skips test if none available to avoid routing convergence issues Added loganalyzer fixture (ignore_expected_loganalyzer_exceptions_lag) to suppress expected routeCheck errors that occur during port configuration changes. The fixture ignores: orchagent port speed errors routeCheck temporary failures with unaccounted routes How did you verify/test it? Ran all ethernet interface tests on multi-ASIC linecard platforms where all ports were in PortChannels Verified that tests successfully complete without routeCheck failures during teardown Confirmed that port removal and restoration works correctly via rollback mechanism Signed-off-by: Nanma Purushotam <gupurush@gmail.com> * Trigger CI checks Signed-off-by: Nanma Purushotam <gupurush@gmail.com> * Updated conftest, and removed test based loganalyzer Signed-off-by: Nanma Purushotam <gupurush@gmail.com>

Fix incorrect fallback value for serial device prefix in _get_serial_device_prefix(). Changed from /dev/ttyUSB- to /dev/ttyUSB to generate correct device paths (e.g., /dev/ttyUSB1 instead of /dev/ttyUSB-1). What is the motivation for this PR? When udevprefix.conf cannot be read, the fallback value /dev/ttyUSB- generates invalid device paths. How did you do it? Changed fallback from /dev/ttyUSB- to /dev/ttyUSB and updated related docstrings/comments. How did you verify/test it? Code review. Signed-off-by: cliffchen <t-cliffchen+github@microsoft.com>

What is the motivation for this PR? Enable test_po_update with ipv6-only topo How did you do it? Added steps for ipv6 How did you verify/test it? Ran on ipv6-only topo testbed

What is the motivation for this PR? To enable ipv6 only topo for test_static_route How did you do it? Have a check for ipv6 and use appropriate prefix for it. How did you verify/test it? Tested on ipv6 only testbed to verify

…pv6-default] since it is duplicated with skip condition (sonic-net#21759) What is the motivation for this PR? Remove duplicated xfail, since it should already been cover by the skip condition. So conditions can be more clear, less confusion and less maintenance.

What is the motivation for this PR? The previous test_ipinip teardown could leave the dut in an unhealthy state after the test finishes, causing several docker containers to remain down and impacting subsequent tests. How did you do it? Updated the test_ipinip teardown logic to restore the dut by performing a golden config reload. How did you verify/test it? Ran test_ipinip and passed.

DPU config can be deployed after the switch configuration. Minigraph deployment command has been extended: Deploy minigraph for both Switch and DPU: ./testbed-cli.sh deploy-mg {{SWITCH}}-{{TOPO}} lab vault -vvvvv Deploy minigraph for Switch only: ./testbed-cli.sh deploy-mg {{SWITCH}}-{{TOPO}} lab vault -vvvvv --skip-tags "dpu_config" Deploy minigraph for DPU only: ./testbed-cli.sh deploy-mg {{SWITCH}}-{{TOPO}} lab vault -vvvvv --tags "dpu_config" Signed-off-by: nmirin <nikolay.a.mirin@gmail.com>

Signed-off-by: Longxiang <lolv@microsoft.com>

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>

lolyu closed this Feb 2, 2026

lolyu reopened this Feb 2, 2026

lolyu closed this Feb 2, 2026

lolyu reopened this Feb 2, 2026

greptile-apps bot reviewed Feb 2, 2026

View reviewed changes

lolyu force-pushed the parallel_manager2 branch 7 times, most recently from 8ab4754 to e1bc7a9 Compare February 5, 2026 06:09

hdwhdw and others added 16 commits February 6, 2026 08:43

chore: enable test_nhop_group on lt2/ft2 (sonic-net#22293)

3e8ef3e

Signed-off-by: Austin Pham <austinpham@microsoft.com>

Adding a step to test_interfaces.py to verify pmon contains all inter…

d862683

…faces data for the test

Adjust tests/bgp/test_bgp_update_replication.py::test_bgp_update_repl…

ae012ac

…ication to handle IPV6 only topologies (sonic-net#21863)

Adjust bgp commands for v6 when v6 only. (sonic-net#22167)

869f9db

adjust buffer test to support SN6600

4020092

lakshmi-nexthop and others added 9 commits February 9, 2026 11:55

Enable test_po_update with ipv6-only topo (sonic-net#21910)

4af314c

What is the motivation for this PR? Enable test_po_update with ipv6-only topo How did you do it? Added steps for ipv6 How did you verify/test it? Ran on ipv6-only topo testbed

Enable test_static_route for ipv6 only topo (sonic-net#21765)

0048c8e

What is the motivation for this PR? To enable ipv6 only topo for test_static_route How did you do it? Have a check for ipv6 and use appropriate prefix for it. How did you verify/test it? Tested on ipv6 only testbed to verify

lolyu force-pushed the parallel_manager2 branch from 42e170d to 70897f8 Compare February 10, 2026 06:03

lolyu added 18 commits February 10, 2026 06:18

Add parallel manager implementation

a3e981c

Signed-off-by: Longxiang <lolv@microsoft.com>

Add HLD

689a5b4

Signed-off-by: Longxiang <lolv@microsoft.com>

Make SONiC device classes thread-safe

f3311d2

Signed-off-by: Longxiang <lolv@microsoft.com>

Fix deadlock in child process after fork

7fb68de

Signed-off-by: Longxiang <lolv@microsoft.com>

Hook parallel manager

1477179

Signed-off-by: Longxiang <lolv@microsoft.com>

Fix exception interact

7422a34

Signed-off-by: Longxiang <lolv@microsoft.com>

Refine fixture reorder logic

f463a46

Signed-off-by: Longxiang <lolv@microsoft.com>

Enable docker sys_ptrace

b428de2

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>

Get thread worker count based on DUT count

98c2d57

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>

Fix ansible display deadlock

a9e5e24

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>

Add deadlock handling to the doc

26a33b4

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>

Remove not-related changes

b652058

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>

Add sd to doc

05daff1

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>

Add missing emtpy line

0ca3b16

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>

Fix comments

ca0ae48

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>

Enable TCP keepalive to avoid broken pipe issue

b9cd1fc

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>

Fix config

63d7858

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>

Fix ansible config

9f0c4af

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>

lolyu force-pushed the parallel_manager2 branch from 70897f8 to 9f0c4af Compare February 10, 2026 06:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Parallel manager2#1

Parallel manager2#1
lolyu wants to merge 43 commits intomasterfrom
parallel_manager2

lolyu commented Feb 2, 2026

Uh oh!

lolyu commented Feb 2, 2026

Uh oh!

greptile-apps bot commented Feb 2, 2026

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 2, 2026

Uh oh!

greptile-apps bot Feb 2, 2026

Uh oh!

Uh oh!

greptile-apps bot Feb 2, 2026

Uh oh!

greptile-apps bot Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Comments

Conversation

lolyu commented Feb 2, 2026

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

lolyu commented Feb 2, 2026

Uh oh!

greptile-apps bot commented Feb 2, 2026

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants